Climate and Ground Water Trends in Columbus, Ohio¶

1. Introduction¶

Context & Importance

This report is the analysis of climate trends and ground water levels in Columbus OH, using data obtained form NOAA website and USGS website. Columbus, OH has a humid continental climate (Dfa), characterized by four distinct seasons:

  • Summers are typically warm to hot and humid, with average high temperatures in July reaching the mid-80s °F (around 30 °C).

  • Winters are cold with average high temperatures in January around the upper 30s °F (about 3−4 °C) and lows in the low to mid-20s °F (around −5 °C). The city receives moderate snowfall.

  • Spring and Autumn are generally mild transitional seasons.

  • Precipitation is moderate and relatively consistent throughout the year, with a slight peak in the late spring and summer.

Research Questions (RQ)

  1. How have annual and seasonal average air temperatures changed from?

  2. Has the frequency of extreme temperature and precipitation events changed over this period?

  3. How do temperature anomalies relate to precipitation patterns in Columbus?

  4. Is there a trend in groundwater levels over time?

  5. Are there seasonal patterns in groundwater levels?

  6. Are extreme highs or lows becoming more common?

Data Description

  • Daily climate observations from 1946–2015.

  • Variables: date, minimum/maximum temperature, temperature at time of observation, precipitation.

  • Daily ground water levels observations from 2005-2015.

  • Variables: date, ground water levels.

Data Preprocessing

  • Handling missing data (imputation or exclusion).

  • Aggregating daily data into monthly, seasonal, annual summaries.

Statistical Tools & Visualizations

  • Trend analysis: linear regression, Mann–Kendall test for monotonic trends.

  • Comparative analysis: decadal averages.

  • Visualization: line charts, bar charts, heatmaps.

  • Correlation/regression for temp–precip relationship.

2. Climate Data Preprocessing¶

*Note: I have discovered the data stopped at the end of the Jan 2015. This means the year of 2015 was missing 11 months of data. To prevent from skewing the analysis I decided to exclude Jan 2015.¶
In [1]:
# libraries
import pymannkendall as mk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from scipy.stats import linregress, f_oneway, ks_2samp, pearsonr, spearmanr, kruskal
In [2]:
# import data; change DATE column to datetime
climate_df = pd.read_csv('climate_data_columbus_oh.csv')
climate_df['DATE'] = pd.to_datetime(climate_df['DATE'])
climate_df = climate_df[climate_df['DATE'].dt.year <= 2014]
In [3]:
print("The size of the data: ", climate_df.shape[0], "rows and", climate_df.shape[1], "columns.")
The size of the data:  24445 rows and 7 columns.
In [4]:
print(climate_df.isnull().sum())
STATION      0
NAME         0
DATE         0
PRCP       417
TMAX       360
TMIN       420
TOBS       353
dtype: int64

The numeric variables contain a few missing data.

In [5]:
print(climate_df.isnull().mean() * 100)
STATION    0.000000
NAME       0.000000
DATE       0.000000
PRCP       1.705870
TMAX       1.472694
TMIN       1.718143
TOBS       1.444058
dtype: float64

The output gives % of missing per column. The rule of thumb is, if it’s under ~5%, it’s usually safe to impute or even drop; if it’s much higher, we need to be more cautious. The numeric columns in our dataset are all less than 5%.

In [6]:
climate_df.dropna(inplace = True)
print("The size of the data after dropping missing values: ", climate_df.shape[0], "rows and", climate_df.shape[1], "columns.")
The size of the data after dropping missing values:  23751 rows and 7 columns.
In [7]:
climate_df.head()
Out[7]:
STATION NAME DATE PRCP TMAX TMIN TOBS
0 USC00331783 COLUMBUS VLY CROSSING, OH US 1946-01-01 0.02 27.0 15.0 23.0
1 USC00331783 COLUMBUS VLY CROSSING, OH US 1946-01-02 0.00 30.0 10.0 28.0
2 USC00331783 COLUMBUS VLY CROSSING, OH US 1946-01-03 0.00 37.0 25.0 37.0
3 USC00331783 COLUMBUS VLY CROSSING, OH US 1946-01-04 0.00 45.0 36.0 44.0
4 USC00331783 COLUMBUS VLY CROSSING, OH US 1946-01-05 0.00 60.0 36.0 59.0

Taking a look at the first 5 rows of our data, all columns are of correct type. Data is ready for further analysis.

3. Visual Analysis of Climate Data¶

RQ1: Long-Term Temperature Trends¶

Annual Average Temperature Trend¶
In [8]:
climate_df['year'] = climate_df['DATE'].dt.year

# Annual averages
annual_temps = climate_df.groupby('year')[['TMIN', 'TMAX']].mean()

# Calculate the mean of TMIN and TMAX
mean_tmin = annual_temps['TMIN'].mean()
mean_tmax = annual_temps['TMAX'].mean()


fig = go.Figure()

# Add min temp line
fig.add_trace(go.Scatter(
    x=annual_temps.index,
    y=annual_temps['TMIN'],
    mode='lines',
    name='Avg Min Temp',
    line=dict(color='steelblue')
))

# Add max temp line
fig.add_trace(go.Scatter(
    x=annual_temps.index,
    y=annual_temps['TMAX'],
    mode='lines',
    name='Avg Max Temp',
    line=dict(color='tomato')
))

# Add mean min temp line
fig.add_trace(go.Scatter(
    x=annual_temps.index,
    y=[mean_tmin]*len(annual_temps),
    mode='lines',
    name=f'Long Term Avg Min Temp ({mean_tmin:.2f}°F)',
    line=dict(color='black', dash='dash')
))

# Add mean max temp line
fig.add_trace(go.Scatter(
    x=annual_temps.index,
    y=[mean_tmax]*len(annual_temps),
    mode='lines',
    name=f'Long Term Avg Max Temp ({mean_tmax:.2f}°F)',
    line=dict(color='black', dash='dash')
))

# Layout
fig.update_layout(
    title="Annual Average Min and Max Temperatures (Columbus, 1946–2014)",
    xaxis_title="Year",
    yaxis_title="Temperature (°F)",
    template="simple_white",
    legend=dict(bordercolor="gray", borderwidth=0.5),
    width=900,
    height=500
)

fig.show()

Both trend lines oscillate closely around their respective long-term mean lines, indicating that while there are short-term fluctuations, the overall averages are stable, though the movement around the mean shows asymmetric warming.

Seasonal Trends (Winter vs. Summer)¶
In [9]:
climate_df['month'] = climate_df['DATE'].dt.month
climate_df['season'] = pd.cut(climate_df['month'], bins=[0,2,5,8,11,12], 
                      labels=['Winter','Spring','Summer','Fall','Winter'], right=True, ordered=False)


seasonal_temps = climate_df.groupby(['year','season'])[['TMAX','TMIN']].mean().reset_index()

# --- Calculate the overall average temperature for each season/year combination ---
seasonal_temps['TAVG'] = (seasonal_temps['TMAX'] + seasonal_temps['TMIN']) / 2


# --- Calculate the Overall Mean TAVG for Winter and Summer ---

# Overall mean AVERAGE temperature for Winter across all years
mean_winter_tavg = seasonal_temps[seasonal_temps['season'] == 'Winter']['TAVG'].mean()

# Overall mean AVERAGE temperature for Summer across all years
mean_summer_tavg = seasonal_temps[seasonal_temps['season'] == 'Summer']['TAVG'].mean()

# Dictionary to hold the means for easy lookup in the loop
mean_temps_avg = {
    'Winter': mean_winter_tavg,
    'Summer': mean_summer_tavg
}

# Define colors for seasons
colors = {'Winter': 'steelblue', 'Summer': 'tomato'}

fig = go.Figure()

for season in ['Winter', 'Summer']:
    subset = seasonal_temps[seasonal_temps['season'] == season]
    color = colors[season]
    mean_val = mean_temps_avg[season]

    # Plot seasonal average temperature
    fig.add_trace(go.Scatter(
        x=subset['year'],
        y=subset['TAVG'],
        mode='lines',
        name=f'{season} Avg Temp',
        line=dict(color=color, width=2)
    ))

    # Plot overall mean line
    fig.add_trace(go.Scatter(
        x=subset['year'],
        y=[mean_val]*len(subset),
        mode='lines',
        name=f'Overall Mean {season} TAVG ({mean_val:.2f}°F)',
        line=dict(color=color, dash='dash'),
    ))

# Layout
fig.update_layout(
    title='Annual Average Winter and Summer Temperatures (Columbus, 1946–2014)',
    xaxis_title='Year',
    yaxis_title='Temperature (°F)',
    template='simple_white',
    width=900,
    height=500,
    legend=dict(bordercolor="gray", borderwidth=0.5)
)

fig.show()

Winter temperatures exhibit much greater inter-annual variability (larger swings above and below the mean) than summer temperatures. This is typical, as winter weather is more heavily influenced by large-scale, rapidly changing atmospheric patterns (cold fronts, arctic air masses).

Summer temperatures are highly stable, remaining close to the 72°F mean for most of the period, reflecting the stabilizing effect of consistent solar radiation and local climate factors during the warmest months.

RQ2: Extreme Events¶

Frequency of Hot & Cold Days per Decade¶
In [10]:
climate_df['decade'] = (climate_df['year']//10)*10

extremes = climate_df.groupby('decade').agg(
    hot_days = ('TMAX', lambda x: (x>90).sum()),
    cold_days = ('TMIN', lambda x: (x<0).sum())
)

# Create the bar chart
fig = go.Figure()

# Hot days
fig.add_trace(go.Bar(
    x=extremes.index,
    y=extremes['hot_days'],
    name='Hot Days (>90°F)',
    marker_color='tomato'
))

# Cold days
fig.add_trace(go.Bar(
    x=extremes.index,
    y=extremes['cold_days'],
    name='Cold Days (<0°F)',
    marker_color='steelblue'
))

# Layout
fig.update_layout(
    barmode='group',  # side-by-side bars
    title='Extreme Temperature Days per Decade (Columbus, 1946–2014)',
    xaxis_title='Decade',
    yaxis_title='Number of Days',
    template='simple_white',
    width=900,
    height=500,
    legend=dict(bordercolor="gray", borderwidth=1)
)

fig.show()

The plot clearly illustrates an asymmetric change in climate extremes: the risk associated with extreme cold has drastically reduced, while the risk associated with extreme heat has generally increased and remained high throughout the late 20th and early 21st centuries.

Frequency of Heavy Rainfall Days (>1 inch/day)¶
In [11]:
rain_extremes = climate_df[climate_df['PRCP'] > 1].groupby('decade').size()

# Create the bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    x=rain_extremes.index,
    y=rain_extremes.values,
    name='Heavy Rainfall Days (>1 inch/day)',
    marker_color='CornflowerBlue'
))

# Layout
fig.update_layout(
    title='Heavy Rainfall Days per Decade (>1 inch/day)',
    xaxis_title='Decade',
    yaxis_title='Number of Days',
    template='simple_white',
    width=900,
    height=500
)

fig.show()

The plot clearly illustrates increasing trend in the frequency of heavy rainfall events over the period, peaking around the turn of the century, before a sharp drop in the 2010s.

RQ3: Temperature–Precipitation Relationships¶

Scatterplot: Summer Temp vs Summer Precipitation¶
In [12]:
# Filter for summer (already done)
summer_daily = climate_df[climate_df['season'] == "Summer"].copy()

# Add small offset to avoid log(0)
PRCP_OFFSET = 0.01
summer_daily['PRCP_offset'] = summer_daily['PRCP'] + PRCP_OFFSET

# Create scatter plot with color gradient based on precipitation
fig = px.scatter(
    summer_daily,
    x='TMAX',
    y='PRCP_offset',
    color='PRCP',  # original precipitation for color intensity
    color_continuous_scale='Teal',  # blue-green gradient
    opacity=0.6,
    hover_data={'TMAX': True, 'PRCP': True, 'PRCP_offset': False},
    labels={'TMAX': 'Daily Max Temp (°F)', 'PRCP_offset': 'Daily Precipitation (inches + 0.01)'},
    title='Daily Max Temperature vs Precipitation in Summer (Columbus, 1946–2014)'
)

# Add simple rolling mean trend line
sorted_data = summer_daily.sort_values('TMAX')
tmax = sorted_data['TMAX']
trend = sorted_data['PRCP_offset'].rolling(30, min_periods=1, center=True).mean()

fig.add_traces(go.Scatter(
    x=tmax,
    y=trend,
    mode='lines',
    line=dict(color='tomato', width=2),
    showlegend=False
))

# Log y-axis
fig.update_yaxes(type='log', title_text='Daily Precipitation (inches + 0.01)')

fig.update_layout(
    width=900,
    height=600,
    template='simple_white',
    coloraxis_colorbar=dict(title="Precipitation (inches)")
)

fig.show()

The plot illustrates the relationship between Daily Maximum Temperature (TMAX) and Daily Precipitation (PRCP) during summer months in Columbus (1946–2014). Each point represents a day, and precipitation values are shown on a logarithmic scale (with a 0.01-inch offset added to handle zero values). This transformation allows both frequent light rainfall events and rare heavy rainfall extremes to be visualized on the same axis.

The red line represents a rolling-mean smoothing curve, highlighting the central trend between temperature and precipitation.

Key Patterns Observed:

Nonlinear Relationship

The overall pattern is clearly nonlinear. The smoothed trend line and the scatter distribution show that precipitation does not increase or decrease in a simple, straight-line fashion with temperature.

Low to Moderate Temperatures (10°F–55°F):

The trend line is relatively flat and elevated near 0.01 inches (the offset floor), suggesting a higher average probability of rainfall compared to hotter days. While many dry days exist, light precipitation events are more common in this cooler range.

Moderate to Warm Temperatures (60°F–85°F):

The trend line declines as temperatures rise, indicating that typical summer highs correspond with fewer average rainfall days. The log scaling makes it clear that while most days are dry, occasional moderate rainfall occurs.

High Temperatures (85°F–105°F):

The trend line approaches the offset baseline, meaning nearly all of these hottest summer days are dry. The scattered nonzero points in this range represent rare rain events, often extreme, but they are statistical outliers.

Precipitation Extremes:

The largest rainfall events (exceeding 1 inch/day) cluster in the mid-temperature range (65–90°F), consistent with conditions favorable for intense summer thunderstorms—warm enough to provide convective energy, but not so hot as to signal drought conditions.

In summary, the log-scaled plot shows that cooler summer days have a higher likelihood of light to moderate rainfall, mid-range temperatures host the most intense storm events, and the hottest summer days are overwhelmingly dry.

Correlation Heatmap (between average min, max temperatures and total precipitation)¶
In [13]:
annual = climate_df.groupby('year').agg(
    avg_tmax=('TMAX','mean'),
    avg_tmin=('TMIN','mean'),
    total_precip=('PRCP','sum')
)

# Compute correlation matrix
corr_matrix = annual.corr()

# Heatmap
fig = px.imshow(
    corr_matrix,
    text_auto=True,  # show correlation values on cells
    color_continuous_scale='RdBu_r',  # similar to 'coolwarm'
    origin='upper',
    labels=dict(x="Climate Variable", y="Climate Variable", color="Correlation")
)

fig.update_layout(
    title='Correlation Between Annual Climate Variables',
    width=700,
    height=600
)

fig.show()

The key interpretation is that temperatures are strongly correlated with each other, but only weakly correlated with precipitation. In simple terms: knowing how hot or cold a year was (max or min temp) tells us very little about how much total precipitation that year received.

4. Statistical Significance of Climate Trends¶

RQ1: Long-Term Temperature Trends¶

Linear Regression (warming trend over years)¶
In [14]:
# Annual mean temperature
annual_mean = climate_df.groupby('year')[['TMAX','TMIN']].mean().reset_index()

# Test trend in max temps
slope, intercept, r_value, p_value, std_err = linregress(annual_mean['year'], annual_mean['TMAX'])
print(f"Max Temp Trend: slope={slope:.3f} °F/year, p={p_value:.4f}, R²={r_value**2:.3f}")

# Test trend in min temps
slope, intercept, r_value, p_value, std_err = linregress(annual_mean['year'], annual_mean['TMIN'])
print(f"Min Temp Trend: slope={slope:.3f} °F/year, p={p_value:.4f}, R²={r_value**2:.3f}")
Max Temp Trend: slope=-0.001 °F/year, p=0.9385, R²=0.000
Min Temp Trend: slope=-0.020 °F/year, p=0.0598, R²=0.052

Days Are Not Warming (Max Temp): The trend is flat (slope≈0). The high p-value (0.9385) confirms that any small change observed is likely due to random year-to-year variation, not a consistent long-term trend.

Nights Are Warming (Min Temp): The average minimum temperature is increasing by +0.020°F every year. Over the 68 years, this equals about a 1.4°F total increase. The p-value (0.0598) is close to the significance threshold, making this warming trend likely real.

Narrowing Temperature Range: Since the nights are warming and the days are not, the difference between the daily high and low temperatures (the Diurnal Temperature Range) is getting smaller.

Mann–Kendall Non-Parametric Test (trend robustness)¶

The Mann–Kendall Non-Parametric Test is a statistical test used to assess whether there is a monotonic trend (consistently increasing or decreasing pattern) in a time series dataset. It's often referred to as a "trend robustness" test because it makes very few assumptions about the underlying distribution of the data, making it robust to non-normal data and outliers.

In [15]:
mk_test_max = mk.original_test(annual_mean['TMAX'])
mk_test_min = mk.original_test(annual_mean['TMIN'])

print("Mann-Kendall Max Temp s-statistic and p-value:", mk_test_max.s, ",", round(mk_test_max.p, 2))
print("Mann-Kendall Min Temp s-statistic and p-value:", mk_test_min.s,",", round(mk_test_min.p, 2))
Mann-Kendall Max Temp s-statistic and p-value: 82.0 , 0.67
Mann-Kendall Min Temp s-statistic and p-value: -134.0 , 0.49

The Mann–Kendall trend test indicates no statistically significant long-term monotonic trend in maximum or minimum daily temperatures in Columbus, OH, from 1946 to 2014 (p > 0.05 for both). Although the minimum temperature series shows a weak negative S statistic, this trend is not significant.

RQ2: Extreme Events¶

Change in Frequency of Hot & Cold Days¶
In [16]:
# Hot days >90°F, cold days <0°F
hot_days = climate_df.groupby('year')['TMAX'].apply(lambda x: (x>90).sum())
cold_days = climate_df.groupby('year')['TMIN'].apply(lambda x: (x<0).sum())

# Regression test for hot days
slope, intercept, r_value, p_value, std_err = linregress(hot_days.index, hot_days.values)
print(f"Hot Days Trend: slope={slope:.2f} days/year, p={p_value:.4f}")

# Regression test for cold days
slope, intercept, r_value, p_value, std_err = linregress(cold_days.index, cold_days.values)
print(f"Cold Days Trend: slope={slope:.2f} days/year, p={p_value:.4f}")
Hot Days Trend: slope=0.08 days/year, p=0.1272
Cold Days Trend: slope=0.00 days/year, p=0.8740

Trend analysis of temperature extremes in Columbus, OH (1946–2014) showed no statistically significant change in the frequency of hot or cold days. Hot days exhibited a slight positive trend (+0.08 days/year, p = 0.1272), while cold days showed no change (+0.00 days/year, p = 0.8740). These results suggest that the observed variations are consistent with natural variability rather than a persistent long-term trend.

Change in Heavy Rainfall Days¶
In [17]:
heavy_rain = climate_df.groupby('year')['PRCP'].apply(lambda x: (x>1).sum())

slope, intercept, r_value, p_value, std_err = linregress(heavy_rain.index, heavy_rain.values)
print(f"Heavy Rainfall Days Trend: slope={slope:.2f} days/year, p={p_value:.4f}")
Heavy Rainfall Days Trend: slope=0.05 days/year, p=0.0018

Test indicates a statistically significant increase in heavy rainfall days in Columbus, OH, from 1946 to 2015 (slope = +0.05 days/year, p = 0.0018). This corresponds to an increase of nearly 3 additional heavy rainfall days over the 70-year study period, suggesting a shift toward more frequent extreme precipitation events.

Correlation Between Summer Temp & Precip¶
In [18]:
print("---------------------------------------------------------------------------------")
print("Pearson & Spearman Correlation Statistics for Summer Max Temp and Precipitation:")
pearson_r, pearson_p = pearsonr(summer_daily['TMAX'], summer_daily['PRCP'])
print(f"Pearson r={pearson_r:.3f}, p={pearson_p:.4f}")
spearman_r, spearman_p = spearmanr(summer_daily['TMAX'], summer_daily['PRCP'])
print(f"Spearman ρ={spearman_r:.3f}, p={spearman_p:.4f}")

print("---------------------------------------------------------------------------------")
print("Pearson & Spearman Correlation Statistics for Summer Min Temp and Precipitation:")
pearson_r, pearson_p = pearsonr(summer_daily['TMIN'], summer_daily['PRCP'])
print(f"Pearson r={pearson_r:.3f}, p={pearson_p:.4f}")
spearman_r, spearman_p = spearmanr(summer_daily['TMIN'], summer_daily['PRCP'])
print(f"Spearman ρ={spearman_r:.3f}, p={spearman_p:.4f}")
print("---------------------------------------------------------------------------------")
---------------------------------------------------------------------------------
Pearson & Spearman Correlation Statistics for Summer Max Temp and Precipitation:
Pearson r=-0.058, p=0.0000
Spearman ρ=-0.097, p=0.0000
---------------------------------------------------------------------------------
Pearson & Spearman Correlation Statistics for Summer Min Temp and Precipitation:
Pearson r=0.186, p=0.0000
Spearman ρ=0.277, p=0.0000
---------------------------------------------------------------------------------

Correlation analysis shows statistically significant but weak associations between summer temperatures and daily precipitation in Columbus, OH (1946–2015). For maximum temperatures, Pearson’s correlation (r = -0.058, p < 0.001) and Spearman’s rank correlation (ρ = -0.097, p < 0.001) both indicate a small negative relationship: hotter summer days are slightly more likely to be dry. In contrast, for minimum temperatures, Pearson’s correlation (r = 0.186, p < 0.001) and Spearman’s rank correlation (ρ = 0.277, p < 0.001) reveal a modest positive relationship: warmer nights are associated with higher daily precipitation.

Overall, these results suggest that while hot summer days often coincide with dryness, warmer nighttime conditions are more conducive to rainfall.

5. Analysis of Ground Water Levels Data.¶

5(a). Preprocessing.¶

In [19]:
import dataretrieval.nwis as nwis

# USGS site code for ground water levels in Columbus OH.
site = '400540082540400'

# retrieve data
gwl_df = nwis.get_record(sites=site, service='dv', start='2005-10-06', end='2025-09-26', parameterCd='72019')

# reset index and set date to datetime
gwl_df = gwl_df.reset_index()
gwl_df['datetime'] = pd.to_datetime(gwl_df['datetime'])

# rename needed columns 
rename_dict = {'datetime': 'Date', '72019_Maximum': 'Water_Level_ft'}
gwl_df = gwl_df.rename(columns=rename_dict)

# columns to retain
columns_to_keep = ['Date', 'Water_Level_ft']
gwl_df = gwl_df[columns_to_keep]
In [20]:
print("The count of missing values for date and ground water levels:")
gwl_df.isna().sum()
The count of missing values for date and ground water levels:
Out[20]:
Date              0
Water_Level_ft    0
dtype: int64
In [21]:
print("The data types:")
gwl_df.dtypes
The data types:
Out[21]:
Date              datetime64[ns, UTC]
Water_Level_ft                float64
dtype: object
In [22]:
print("The dataset has", gwl_df.shape[0], "rows and", gwl_df.shape[1], "columns.")
The dataset has 7268 rows and 2 columns.
In [23]:
print("First 5 rows of data:")
gwl_df.head()
First 5 rows of data:
Out[23]:
Date Water_Level_ft
0 2005-10-06 00:00:00+00:00 21.61
1 2005-10-07 00:00:00+00:00 21.62
2 2005-10-08 00:00:00+00:00 21.59
3 2005-10-09 00:00:00+00:00 21.64
4 2005-10-10 00:00:00+00:00 21.68

5(b). Visual Trends¶

RQ4. Is there a trend in groundwater levels over time?¶

Long-term Trend and Mann-Kendall test

In [24]:
result = mk.original_test(gwl_df['Water_Level_ft'])

# Scatter for original groundwater levels
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=gwl_df['Date'],
    y=gwl_df['Water_Level_ft'],
    mode='lines',
    name='Groundwater Levels',
    line=dict(color='steelblue', width=2),
    #opacity=0.7
))

# Rolling mean (30-day)
rolling_mean = gwl_df['Water_Level_ft'].rolling(30).mean()
fig.add_trace(go.Scatter(
    x=gwl_df['Date'],
    y=rolling_mean,
    mode='lines',
    name='30-day Rolling Mean',
    line=dict(color='tomato', width=2)
))

# Add annotation for Mann-Kendall test result
mk_text = f"Trend: {result.trend}<br>Tau={result.Tau:.2f}, p={result.p:.3f}"
fig.add_annotation(
    xref='paper', yref='paper',
    x=0.5, y=1,
    text=mk_text,
    showarrow=False,
    bordercolor="black",
    borderwidth=1,
    bgcolor="white",
    #opacity=0.7
)

# Layout
fig.update_layout(
    title="Groundwater Levels Over Time",
    xaxis_title="Year",
    yaxis_title="Groundwater Level (ft)",
    template='simple_white',
    width=900,
    height=500,
    legend=dict(bordercolor="gray", borderwidth=0.5)
)

fig.show()

The groundwater levels exhibit a strong, predictable seasonal pattern superimposed on a statistically significant but slight long-term decreasing trend over the observed period (2007-2025). Observation of a slight lowering of the oscillation pattern between 2012 and 2020, suggests lower ground water levels.

RQ5. Are there seasonal patterns in groundwater levels?¶

Distributions of ground water levels and One-Way ANOVA Test

In [25]:
gwl_df['month'] = gwl_df['Date'].dt.month

groups = [gwl_df[gwl_df['month'] == m]['Water_Level_ft'] for m in gwl_df['month'].unique()]
anova_result = f_oneway(*groups)

# Boxplot of Water Level by Month
fig = px.box(
    gwl_df,
    x='month',
    y='Water_Level_ft',
    color='month',
    #color_continuous_scale='RdBu',  # similar to "coolwarm"
    labels={'month': 'Month', 'Water_Level_ft': 'Groundwater Level (ft)'},
    title='Seasonal Variation in Groundwater Levels'
)

# Remove legend for colors if desired
fig.update_layout(showlegend=False)

# Add ANOVA annotation
anova_text = f"ANOVA F={anova_result.statistic:.2f}, p={anova_result.pvalue:.3f}"
fig.add_annotation(
    xref='paper', yref='paper',
    x=0.02, y=1,
    text=anova_text,
    showarrow=False,
    bordercolor="black",
    borderwidth=1,
    bgcolor="white",
    #opacity=0.7,
    font=dict(size=12)
)

fig.update_layout(
    width=900,
    height=500,
    template='simple_white',
    xaxis=dict(title='Month'),
    yaxis=dict(title='Groundwater Level (ft)')
)

fig.show()

The plot clearly demonstrates a highly significant seasonal cycle in groundwater levels. Levels consistently peak in the late fall/early winter (October-December) and reach their minimum in the early spring (March-May). This pattern is typical in many regions where recharge is dominant in cooler, wetter seasons and drawdown increases during the growing season.

The ANOVA result indicates that the differences in groundwater levels across the 12 months of the year are highly statistically significant. The null hypothesis—that the mean groundwater level is the same for all months—is strongly rejected. This confirms that there is a definite and measurable seasonal pattern.

RQ6. Are extreme highs or lows becoming more common?¶
In [26]:
# Extract year
gwl_df['year'] = gwl_df['Date'].dt.year

# Select last 3 years
last_years = sorted(gwl_df['year'].unique())[-3:]
subset = gwl_df[gwl_df['year'].isin(last_years)]

# Prepare data for each year
data = [subset[subset['year'] == y]['Water_Level_ft'] for y in last_years]

# Create KDE plots
fig = ff.create_distplot(
    data,
    group_labels=[str(y) for y in last_years],
    show_hist=False,    # no histogram bars
    show_rug=False,
    colors=['#66c2a5', '#fc8d62', '#8da0cb'],  # Set2 palette
    curve_type='kde'
)

fig.update_traces(line=dict(width=3))

# Layout
fig.update_layout(
    title="Distribution of Groundwater Levels by Year",
    xaxis_title="Groundwater Level (ft)",
    yaxis_title="Density",
    width=900,
    height=500,
    template='simple_white'
)

fig.show()

The plot demonstrates a shift in the distribution of groundwater levels:

  • 2024 was a year where levels were most frequently high (concentrated at 22.5 ft).

  • 2023 had a more balanced frequency between high and mid-range levels.

  • 2025 (for the data available) shows a strong central tendency with levels primarily clustered in the mid-range (≈21.2 ft) and a lack of the high-level events seen in the previous two years. This is consistent with a year that has generally lower groundwater levels compared to 2024.

In [27]:
# group data by year
groups = [g['Water_Level_ft'].values for _, g in gwl_df.groupby('year')]

# Kruskal-Wallis test
kw_result = kruskal(*groups)
print("Kruskal-Wallis H-stat:", round(kw_result.statistic,2), "p-value:", kw_result.pvalue)
Kruskal-Wallis H-stat: 2047.56 p-value: 0.0

This result aligns with the prior analysis (especially the KDE plot comparing 2023, 2024, and 2025) which showed a clear shift in the distribution of levels between years. The Kruskal-Wallis test confirms that these visual differences are statistically robust, meaning the average water conditions change significantly from one year to the next in terms of overall level.